Introduction

Association analysis in microbiome studies infers relationship between OTUs/ taxa. Pearson r Correlation and Spearman’s Rank Correlation are classical methods in association analysis. However, microbiome relative data is compositional and consequently make the classical methods inappropriate. If naively applying classical methods to compositional microbiome data, we likely end up with getting spurious correlations.

The review article published in 2017, “Microbiome datasets are compositional: and this is not optional” (Gloor et al. 2017)summarized four newly-invented statistical methods dealing with compositionality and achieving reliable association relationship. The four methods consists of SparCC, Spiec Easi, proportionality (ϕ and ρ).

Proportionality is a precise indicator of absolute correlation, although sensitivity is limited (Thomas P. Quinn et al. 2017).

In this project, I intend to demonstrate analysis over the HMP-IBD microbiome dataset using the above four statistical methods; and to compare outcomes based on the four methods against ones based on Spearman’s Rank Correlation.

R workflow (use proportionality analysis with propr package as an example)

  1. Load dataset
  2. Calculate proportionality
  3. Identify proportionally abundant taxa
  4. Visualizing proportionality
  5. Differential proportionality analysis

Scripts

library(tidyverse) 

library(devtools)
## Warning: package 'devtools' was built under R version 4.2.2
## Warning: package 'usethis' was built under R version 4.2.2
install_github("tpq/propr")  
library(propr)

library(stringr)

Step 1. Load dataset

The microbiome OTU table and metadata was retrieved from ML Repo

# raw OTU table
raw_otu <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/refseq/otutable.txt",
                    header=T,
                    sep = "")

The raw OTU table contains 557 features (OTUs) in 281 samples.

Shorten OTU identifiers with keeping only genus and species names, using stringr package. And then accumulate counts of OTUs at the species level.

head(raw_otu$X.OTU) 
## [1] "NR_147401.1_Sutterella_massiliensis_strain_Marseille-P2435_16S_ribosomal_RNA__partial_sequence"    
## [2] "NR_025930.1_Ruminococcus_bromii_strain_ATCC_27255_16S_ribosomal_RNA_gene__partial_sequence"        
## [3] "NR_025230.1_Megasphaera_micronuciformis_strain_AIP_412.00_16S_ribosomal_RNA_gene__partial_sequence"
## [4] "NR_024994.1_Lactobacillus_mucosae_strain_S32_16S_ribosomal_RNA_gene__complete_sequence"            
## [5] "NR_024750.1_Coprococcus_catus_strain_VPI-C6-61_16S_ribosomal_RNA_gene__partial_sequence"           
## [6] "NR_024684.1_Eubacterium_nitritogenes_strain_JCM_6485_16S_ribosomal_RNA_gene__partial_sequence"
# split strings 
otu.id <- raw_otu$X.OTU

split_otu.id <- str_split_fixed(otu.id, "_", n=6) %>% 
        as.data.frame()

# join genus and species fragments 
join_split_otu.id <- split_otu.id %>% 
        mutate(taxa = str_c(V3, V4, sep = " ")) 

taxa <- join_split_otu.id$taxa 

# merge taxa to raw otu table 
raw_otu_taxa <- cbind(raw_otu, taxa) 


# accumulate counts at the species level 
otu_count_splevel <- raw_otu_taxa %>% 
        gather(key = "sample", value = "count", -c("ID", "X.OTU", "taxa")) %>% 
        group_by(sample, taxa) %>%
        summarise(count_sp = sum(count, na.rm = T)) %>% 
        arrange(desc(count_sp)) %>%
        ungroup()

Read in the metadata from URL.

# metadata 

meta <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/task-obese-lean-all.txt",
         header = TRUE,
         sep = "") %>% 
        mutate(sample = X.SampleID, is.obese = Var) %>% 
        select(sample, is.obese)

head(meta)

The metadata contains 142 samples. Var indicates the independent, binary variable of interest, Lean and Obese.

Transpose the OTU table into sample x taxa format which is suitable for proportionality analysis.

otu_count_splevel2 <- otu_count_splevel %>% 
        spread(key = "taxa", value = count_sp) %>%
        column_to_rownames("sample")

Now, OTU count table at the species level contains 517 features, in 281 samples.

Filter low abundant taxa before the proportionality analysis. Keep only the taxa with at least 10 counts in at least 5 samples.

keep_index <- apply(otu_count_splevel2, 2, function(x) sum(x >= 10) >= 5)  

The resulting OTU table after filtering the low abundant taxa contains 154 unique species/taxa.

Add group variables to the resulting OTU table.

otu_count_splevel2_meta <- otu_count_splevel2 %>%
        rownames_to_column("sample") %>%
        inner_join(meta, by= "sample") %>%
        column_to_rownames("sample") 

head(otu_count_splevel2_meta)

Check if any missing values in the OTU table. Replace missing values with zero if any.

sum(is.na(otu_count_splevel2)) 
## [1] 0
# otu_count_splevel2_rmna <- otu_count_splevel2 %>%
#         replace(is.na(.), 0)
# 
# sum(is.na(otu_count_splevel2_rmna))

Step 2. Calculate proportionality using propr R package

Calculate a proportionality measurement, rho using propr function, following the procedures and codes provided by T.P.Quinn et al. (Thomas P. Quinn et al. 2019)

Set three required arguments in the wrapper propr function as follows,

  • counts a count matrix with subjects as rows and features as columns

  • metric the proportionality metric to calculate, ‘rho’, ‘pho’ or ‘phs’

  • ivar reference OTU/ feature

pr <- propr(counts = otu_count_splevel2, # a count matrix with subjects as rows and features as columns
            metric = "rho", # the proportionality metric to caculate, 'rho', 'pho' or 'phs'
            ivar = "clr", # reference OTU/ feature for alr
            symmetrize = FALSE,
            p = 100, # permutation cycles (100, by default)  
            select = keep_index) # filter low-abundant taxa 


## alternatively, calculate phi, rho or phs using phit(), perb() and phis(), respectively. 
# phi <- phit(raw_otu_t_na, symmetrize = TRUE) 
# 
# rho <- perb(raw_otu_t_na, ivar = 0)
# 
# phs <- phis(raw_otu_t_na, ivar = 0) 

For proportionality, permute false discovery rate (FDR) for a given cut-off, instead of estimating parametric P-value.

# select a good cutoff for 'rho' by permuting the FDR at various cutoffs. 
# below, use [0, .05, ..., .95, 1]

# pr <- updateCutoffs(pr, cutoff = seq(0,1,.05))   
# 
# pr@fdr

According to the package vignette, choose the largest cutoff that keeps the FDR below 0.05. In this case, rho cutoff is 0.20.

Visualize proportionality with a strict cutoff, using getNetwork function. The package vignette describes several built-in tools for visualizing proportionality.

getNetwork(pr, cutoff = 0.2) 

## IGRAPH b2b88e6 UN-- 150 969 -- 
## + attr: name (v/c), color (v/c), size (v/n), label (v/l), color (e/c)
## + edges from b2b88e6 (vertex names):
## [1] Pseudoflavonifractor phocaeensis--Anaeromassilibacillus senegalensis
## [2] Pseudoflavonifractor phocaeensis--Parabacteroides distasonis        
## [3] Pseudoflavonifractor phocaeensis--Flavonifractor plautii            
## [4] Pseudoflavonifractor phocaeensis--Flintibacter butyricus            
## [5] Pseudoflavonifractor phocaeensis--Drancourtella massiliensis        
## [6] Pseudoflavonifractor phocaeensis--{Ruminococcus} torques            
## [7] Pseudoflavonifractor phocaeensis--{Clostridium} xylanolyticum       
## [8] Pseudoflavonifractor phocaeensis--Frisingicoccus caecimuris         
## + ... omitted several edges

Export highly-proportional (rho >= 0.9) features (OTU).

result_rho_df <- getResults(pr, cutoff = 0.2) 

result_rho_df 

How to interpret proportionality? Proportionality depends on log-transformation and must get interpreted with respect to the chosen reference.

Interpret clr-based proportionality to signify a coordination that follows the general trend of the data. In other words, these proportional OTUs move together as individuals relative to how most genes move on average (Thomas P. Quinn et al. 2019).

Differential proportionality analysis

propd functions in the propr package can test whether proportionality of certain pairs change between binary experimental groups.

According to the package vignette, propd method compares the variance of log-ratio (VLR) for one pair across groups, considering two types of differential proportionality,

  • disjointed proportionality: proportionality of a pair holds in both groups, but the ratio between the partners changes between the groups (i.e., the slope of the proportionality changes)

  • emergent proportionality: proportionality exits in only one of the groups (i.e., the strength of the proportionality changes)

propd function estimates differential proportionality by calculating \(theta\) for all features pairs.

The function takes the following arguments as input:

  • counts a matrix of \(n\) samples (as rows) and \(d\) features (as columns)

  • group an \(n\)-dimensional vector corresponding to subject labels

  • alpha an optional argument to trigger and guide transformation

  • p the total number of permutations used to estimate FDR

counts <- otu_count_splevel2_meta %>% select(-is.obese)

group <- otu_count_splevel2_meta$is.obese 

pd <- propd(counts, group, alpha = NA, p=100) 
theta_d <- setDisjointed(pd)  # disjointed diff prop 

theta_e <- setDisjointed(pd)  # emergent diff prop 
theta_d_df <- getResults(theta_d) 

theta_d_df %>% 
  arrange(theta) 
theta_e_df <- getResults(theta_e) 

theta_e_df %>% 
  arrange(theta_e)
plot(theta_d@counts[, "Odoribacter splanchnicus"], 
     theta_d@counts[, "Bacteroides massiliensis"], 
     col = ifelse(theta_d@group == "Obese", "red", "blue"))
grp1 <- theta_d@group == "Obese"
grp2 <- theta_d@group != "Obese"
abline(a = 0, 
       b = theta_d@counts[grp1, "Odoribacter splanchnicus"] / theta_d@counts[grp1, "Bacteroides massiliensis"], 
       col = "red")
abline(a = 0, 
       b = theta_d@counts[grp2, "Odoribacter splanchnicus"] / theta_d@counts[grp2, "Bacteroides massiliensis"], 
       col = "blue")

plot(theta_d@counts[, "Odoribacter splanchnicus"] / theta_d@counts[, "Bacteroides massiliensis"],
     col = ifelse(theta_d@group == "Obese", "red", "blue"))

visualize network based on differential proportionality

g <- plot(theta_d, 
          cutoff=300, 
          d3= F)  # TRUE for displaying 3-D plot 

plot(theta_e, cutoff=200, d3=T) 
## IGRAPH b724f48 UN-- 128 200 -- 
## + attr: name (v/c), color (v/c), size (v/n), label (v/l), color (e/c)
## + edges from b724f48 (vertex names):
##  [1] {Clostridium} lavalense--{Clostridium} oroticum      
##  [2] {Clostridium} lavalense--Eubacterium ruminantium     
##  [3] {Clostridium} lavalense--{Clostridium} celerecrescens
##  [4] {Clostridium} oroticum --{Clostridium} spiroforme    
##  [5] {Clostridium} oroticum --Blautia faecis              
##  [6] {Clostridium} oroticum --Citrobacter freundii        
##  [7] {Clostridium} oroticum --Holdemania filiformis       
##  [8] {Clostridium} oroticum --Neglecta timonensis         
## + ... omitted several edges
Gloor, Gregory B., Jean M. Macklaim, Vera Pawlowsky-Glahn, and Juan J. Egozcue. 2017. “Microbiome Datasets Are Compositional: And This Is Not Optional.” Frontiers in Microbiology 8 (November). https://doi.org/10.3389/fmicb.2017.02224.
Quinn, Thomas P, Ionas Erb, Greg Gloor, Cedric Notredame, Mark F Richardson, and Tamsyn M Crowley. 2019. “A Field Guide for the Compositional Analysis of Any-Omics Data.” GigaScience 8 (9). https://doi.org/10.1093/gigascience/giz107.
Quinn, Thomas P., Mark F. Richardson, David Lovell, and Tamsyn M. Crowley. 2017. “Propr: An R-Package for Identifying Proportionally Abundant Features Using Compositional Data Analysis.” Scientific Reports 7 (1). https://doi.org/10.1038/s41598-017-16520-0.